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ABSTRACT 

The awareness of important biological role played 
by functional, non coding (nc) RNA has grown tre- 
mendously in recent years. To perform their tasks, 
ncRNA molecules typically unite with protein 
partners, forming ribonucleoprotein complexes. 
Structural insight into their architectures can be 
greatly supplemented by computational docking 
techniques, as they provide means for the integra- 
tion and refinement of experimental data that is 
often limited to fragments of larger assemblies or 
represents multiple levels of spatial resolution. 
Here, we present a coarse-grained force field for 
protein-RNA docking, implemented within the 
framework of the ATTRACT program. Complex 
structure prediction is based on energy minimiza- 
tion in rotational and translational degrees of 
freedom of binding partners, with possible exten- 
sion to include structural flexibility. The coarse- 
grained representation allows for fast and efficient 
systematic docking search without any prior know- 
ledge about complex geometry. 



INTRODUCTION 

In recent years, the known inventory of functional RNAs 
that do not belong to protein coding messenger (m) RNA 
class has increased dramatically (1-3). These so-called 
non-coding (nc) RNAs appear to play important roles in 
diverse cellular activities such as RNA processing and 
modification, translation, gene expression, protein 
trafficking or chromosome maintenance. In doing so, 
RNA molecules typically unite with protein partners in 
functional ribonucleoprotein complexes (4-6). Structural 
insight into such assemblies is essential for our under- 
standing of their mechanism of action as well as for the 
future ability to design new diagnostic tools or therapeutic 
strategies. 



The awareness for the importance of ncRNA grows 
much faster, however, than the available body of struc- 
tural data. In spite of spectacular successes, such as 
obtaining high-resolution structures of small and large 
ribosomal subunits (7,8), protein-RNA complexes 
comprise currently only ~4% of records deposited in the 
Protein Data Bank (PDB) (9), while at the same time it is 
estimated that the portion of genome transcribed into 
ncRNA may be even 20 times larger than the protein 
coding part (3). X ray crystallography of macromolecular 
complexes, particularly containing nucleic acids, is a more 
difficult task than the determination of isolated compo- 
nents. Thus, computational docking techniques, aiming at 
the prediction of the complex structure based on its com- 
ponents, are becoming increasingly important. Even 
though they still need to tackle many challenges before 
becoming a reliable standalone tool (10), they already 
play an important role in the integration of structural 
data coming from experimental methods that provide dif- 
ferent levels of spatial resolution (11,12). 

To date, most computational efforts for ribonucleo- 
proteins were focused on the characterization of binding 
interfaces (13-17) or the localization of RNA binding sites 
on proteins (18-23) rather than docking. In contrast to 
protein-protein docking that has an established record 
(10), to our knowledge, only very few preliminary 
attempts have been made for the actual protein-RNA 
binding mode prediction (24-26). They are based on 
distance-dependent atomic statistical potentials for 
protein-RNA interactions (24,25) or statistically derived 
propensities for protein-RNA pairing at the residue level 
(26). Descriptors developed for quantifying protein-RNA 
interactions were demonstrated to distinguish between 
native structures and provided decoys (24,25), or to 
improve ranking of near-native solutions obtained using 
shape complementarity-based scoring (26). 

Recently, protein-RNA complexes were included as 
targets in the Critical Assessment of PRediction of 
Interactions (CAPRI) competition (27). Most of the 
participating groups adapted for this occasion methods 
developed for protein-protein docking and used 
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experimental restraints to facilitate native-like geometry 
selection. Out of two ribonucleoprotein targets, one 
required homology modeling of both binding partners 
and was not solved by any of the competing groups, 
while the second, with provided bound RNA geometry, 
allowed for a number of successful, medium accuracy pre- 
dictions. The presence of protein-RNA targets for the first 
time in CAPRI competition indicates the growing interest 
in computational prediction of ribonucleoprotein struc- 
tures and consequently, the need for new methods, 
targeted specifically on protein-RNA docking problems. 

In the current study, we design and parametrize a 
distance-dependent, coarse-grained forcefield for 
protein-RNA interactions. This forcefield is compatible 
with an earlier parameter set developed for protein- 
protein docking by Zacharias (28,29). It allows for fully 
systematic protein-RNA docking by energy minimization 
in the rotational and translational degrees of freedom of 
the binding partners. Unlike propensity-based descriptors, 
the distance-dependent potential function provides all 
necessary means for finding realistic bound geometries 
and their subsequent scoring by the resulting potential 
energy. Structure representation at the sub residual 
coarse-grained level allows for efficient calculations, yet 
at the same time maintains reasonable details of 
physicochemical features. 

In the following sections, we describe forcefield devel- 
opment and testing based on 110 crystallographic struc- 
tures of protein-RNA complexes. We also consider its 
application to few protein-RNA complexes with available 
structural data for bound state as well as unbound 
components. 



H ' ~~ %H 
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Figure 1. Coarse-grained representation for nucleotides. Beads (dashed 
circles) are either centered on particular atoms or at geometric centers 
of a few atoms (dots). 



METHODS 

Coarse-grained representation 

The presented potential for protein-RNA interactions is 
designed to be compatible with the earlier coarse-grained 
parametrization developed for proteins by Zacharias 
(28,29). In Zacharias model, each amino acid is repre- 
sented by up to four pseudoatoms (beads): two corres- 
ponding to main chain nitrogen and oxygen, and one or 
two describing short and long side chains, respectively. In 
total, there are 31 pseudoatom types. To extend this model 
for nucleotides, 17 new bead types were introduced. They 
include three pseudoatoms for phosphate/ribose part, and 
three or four for purine and pyrimidine bases, respectively 
(Figure 1). 

The assumed, pairwise additive interactions between 
protein and RNA beads are described by a distance- 
dependent potential that has two different forms, corres- 
ponding to attractive and repulsive interactions (Figure 2). 
The attractive potential is of Lennard-Jones type, with a 
soft repulsive term: 



Uf\r) = eg 




Figure 2. Repulsive and attractive potential form. >■'" and U'": the 
position and value of C/ attr minimum. 



Pairwise-specific parameters <jy and e,- ; govern interaction 
range and strength, respectively. The repulsive potential is 
defined as: 



Ufttf + lUg forr<^' 



-Vf*(r) 



for r > ?' 



(2) 



(1) where r"j and Vy correspond to the position and value of 
C/?" r minimum. Such formula provides a smooth, easily 
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implemented potential, albeit with non-physically vanish- 
ing force at r m . Unlike in the Zacharias model for proteins, 
no separate electrostatic terms were introduced, thus 
assuming that the above potentials account for all effect- 
ive intermolecular interactions. 

Potential parametrization 

The interaction of each pair ij of protein and RNA beads 
is described by two parameters (cr,y and e,y), hence giving in 
total 1054 parameters for protein-RNA force field. They 
were derived in a knowledge-based manner, using a set of 
protein-RNA crystallographic complexes. At first, 
distance-dependent statistical potentials Gy(d) were con- 
structed for each bead pair, and the initial values of a 
and e parameters were obtained by fitting Equations (1) 
and (2) to G,j(d). The resulting parameter values were sub- 
sequently adjusted to optimize docking results in terms of 
finding the right (close to native) binding mode and its 
proper scoring. The details of parametrization procedure 
are given in the following. 

'Protein-RNA complexes' were selected from crystallo- 
graphic structures used recently for the analysis of 
protein-RNA binding sites (22,23). The provided lists of 
non-redundant complexes, having resolution better than 3 
A, were merged together, and a search for homologous 
structures was performed using Smith and Waterman al- 
gorithm (30) for proteins, with the similarity threshold of 
70% sequence identity, and nucleotide BLAST algorithm 
(31) for RNA, with the similarity threshold of 80% 
sequence identity. Two complexes were deemed redun- 
dant, if similarity thresholds were simultaneously 
exceeded by their both binding partners, thus allowing 
single binding component to be considered for docking 
with different partners. A list of complexes sharing one 
similar binding component is given in the Supplementaty 
Data. 

The non-redundant structures were than subjected to 
manual analysis with the following rules: 

• complexes in which a DNA molecule was found to 
bind protein and RNA molecules were discarded; 

• structures with multiple missing side chains or nucleo- 
tides were discarded; 

• structures with protein-RNA contact involving <5nt 
were discarded; 

• structures in which one of the binding partners was 
ribosome were excluded from the optimization of e 
parameters (see below) and from the test set for 
docking; and 

• homopolymers were identified and their multiplicity 
(«) was taken into account in the derivation of statis- 
tical potentials: polymers were analyzed as monomers, 
but their contribution to bead pair statistics was 
counted with weight l/«. 

All protein chains were processed with pdb2pqr 
program (32) in order to standardize atomic names and, 
if possible, reconstruct missing chemical groups. Complex 
geometry was left as submitted in the crystallographic 
structure. Finally, all structures were converted to 
coarse-grained representation. 



The above procedure resulted in 109 protein-RNA 
complexes, including 20 structures with ribosomal RNA. 
Out of this group, 84 structures, including all ribosomal 
complexes, were used for the statistical potential deriv- 
ation and the optimization of cj parameters, and 64 of 
them (non-ribosomal) were used for the optimization of 
e parameters. The remaining 25 non-ribosomal complexes 
were assigned to the test set. 

'Statistical potentials' for protein-RNA interactions 
were derived, using the following formula: 



Gdd) 



-kT\n 



NobsjiJ, d) 
N exp (i,j, d) 



(3) 



Here, N obs and N exp represent the number of observed and 
expected occurrences of a given ij bead pair at a distance 
d ± 8. In the calculations, d spacing of 0.5 A was used, 
with 5 of 0.25 A, and a distance range up to 14 A was 
considered. While obtaining 7V obs for the known atomic 
coordinates of protein-RNA complexes is rather straight- 
forward, N exp requires more attention. It should corres- 
pond to the number of observations expected for the 
random, interaction-independent placements of protein 
with respect to RNA, given their average structural 
features and (pseudo)atomic composition. 



N exp was defined based on geometric considerations as 



N, 



expO',./, d) 



Xi(d) X j{d)NTot(d). 



(4) 



Here, N Tot (d) corresponds to the total number of pairs 
found with distance d between beads, while Xiid), and 
analogously Xj(d), is a probability that any bead can be 
placed at a distance d from the bead of type i, given 
average structural features of protein (or RNA) molecules. 

Xi(d) values are obtained as the i-th fraction of the total 
accessible surface of radius d, S(d), that can be constructed 
for protein (or separately RNA) structures extracted from 
the available complexes: 



Xiid) 



Sjd) 
S(d) 



, where S(d) = ]T Si(d) 



(5) 



S,(d) is the contribution to the total accessible surface area 
due to all surfaces of type i beads. Those surfaces are 
defined as points accessible to the center of a probe of 
radius R P , remaining at a distance d from the bead of 
interest (Figure 3). 



Si(r,d) 




Figure 3. The accessible surface of radius d for a bead of type ;' located 
at r. R p is a probe radius. Si(d) = J2jSi(r,d). 
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The beads radii were estimated using the acquired 
N 0 bs(i, j, d), as an average of half distance of closest 
approach for all considered interaction partners for a 
given bead. The probe radius, in turn, was defined as an 
average of all bead radii, yielding R p = 1.78 A. The nu- 
merical results for x(d) appeared to be insensitive for 
small (~0.1 A) variations in the applied radii. 

Parameters optimization. The initial parameter set was 
obtained by fitting Equations (1) and (2) to the derived 
statistical potentials and choosing the (a, e) pair providing 
the lowest root mean square deviation (RMSD) from 
Gij(d). In order to optimize the performance of the 
obtained parameter set for protein-RNA docking, the 
parameters were further adjusted using a two-stage 
procedure. 

In the first stage, only a parameters were optimized to 
provide possibly best stability of native structures. Due to 
large dimensionality of parameter space, a random, 
Monte Carlo-like optimization scheme was introduced. 
The following optimization block was iterated until no 
further improvement was observed (In practical applica- 
tion, steps (3) and (4) were modified for better efficacy: the 
search started from a given a value, proceeded in positive 
and than negative direction, but each search direction was 
interrupted before checking all k values as only the score 
started to decrease.): 

(1) start with the current parameter set; 

(2) randomly select one a parameter from those that 
were not yet selected; 

(3) for each of 2k + 1 values: a — kA a , . . , a, . . , a + kA a 
perform potential energy minimizations for the set of 
native complexes, and score the results; 

(4) keep the a value that provides the best score; 

(5) if some a parameters remain to be scanned, go to 
point 2 

The energy minimization was done in protein (ligand) 
translational and rotational degrees of freedom, with 
both binding partners kept rigid. The scoring performed 
in point (3) was based on the following criteria applied to 
each minimized complex: 

• if alpha carbon RMSD of minimized versus native 
protein position was greater than >5A, a 'large 
penalty' (LP) was applied; and 

• if RMSD was between 1.0 and 5.0 A, its square was 
accumulated in Ermsd variable. 

The best score was considered to have the lowest number 
of LP or smallest Srmsd value among the scores with the 
same number of LP. Such scoring scheme was tuned for 
finding solutions close to native (with RMSD <5 A) with 
possibly low RMSD, but deeming unimportant RMSD 
variations below 1.0 A. In practical application, k = 6 
was chosen for each optimization round, and A a value 
was set as 0.1/iVA, where TV corresponded to the optimiza- 
tion round number. 

The second optimization stage involved the adjustments 
of e parameters to enhance scoring of native-like 
complexes. In order to evaluate the scoring efficiency, a 



set of 200 decoys was introduced for each considered 
protein-RNA complex. They were obtained as low 
energy solutions of systematic docking run whose 
RMSD from native structure was >5A. Such systematic 
docking was performed from starting points evenly 
distributed over the receptor surface, hence the loca- 
tions of the decoys were not limited to the area of 
binding interface (see 'Docking protocols' section 
below). 

A following optimization scheme was repeated until no 
further improvement in the number of correctly ranked 
complexes was observed: 

(1) start with the current parameter set and perform 
energy minimization for native complexes and all 
decoys; 

(2) randomly select one e parameter from those that 
were not yet selected; 

(3) for each of 2A:+1 values: e — kA e ,..,e,..,e + kA e 
perform a single point energy evaluation for all 
structures; 

(4) keep the e value that provides the best ranking of 
native complexes; 

(5) if some e parameters remain to be scanned, go to 
point (2). 

The criterion for the best overall ranking was based on 
the ability to provide the lowest energy for native-like 
(energy minimized) complexes with respect to their corres- 
ponding decoys. The best e value was regarded as having 
the highest number of properly ranked (i.e. with rank 1) 
native-like complexes. If a few e values resulted in identical 
ranking, the one with lowest penalty score was then 
chosen. The penalty score was calculated for complexes 
that did not rank properly. It depended on the RMSD 
of native-like complex (native complexes with high 
RMSD after energy minimization were not expected to 
be ranked as precisely as complexes with low RMSD), 
the actual rank of native-like complex among its decoys 
(the higher the rank the smaller penalty) and the difference 
in energy between the native-like complex and the best 
ranked decoy. For the e optimization procedure, k = 2 
and A e = 0.1 kcal/mol were chosen. For further details, 
see the Supplementary Data. 

Parameter set evaluation. The random optimization 
scheme does not guarantee reaching single, globally best 
parameter set, even if (unlikely) such set exists. In order to 
obtain insight into the validity of the adopted optimiza- 
tion procedure, three independent parameters optimiza- 
tions were carried out. The three resulting interaction 
potentials for each bead pair were compared with the 
analysis of standard deviations of their minima or saddle 
point positions with respect to their mean location. 
Minimum or saddle point position is a simple function 
of a parameter (on the distance scale) and e parameter 
(on the energy scale). In order to combine the deviations 
in a and e components, they were normalized by the re- 
spective average a or e deviations over all bead pairs. Such 
normalized deviations, S a and 5 e , were then combined, 
yielding a sing le value S m = JS 2 a + S%, measuring the 
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variability of a given bead pair interaction across the three 
sets. 

Finally, an average, consensus parameter set was con- 
structed, and docking efficacy of all four potentials was 
compared. 

Docking protocols 

All docking simulations performed in the current study 
were carried out with the ATTRACT program (28). 
They involved two rigid partners in coarse-grained repre- 
sentation: an RNA structure treated as an immobile 
receptor and a protein structure treated as a movable 
ligand. No additional information such as distance con- 
straints or binding region location was submitted. Prior to 
docking, a set of starting ligand positions evenly 
distributed with spacing of ~ 12 A around the receptor at 
distances precluding receptor-ligand overlaps was 
generated. For each such position, 208 initial ligand orien- 
tations were considered. A single docking attempt con- 
sisted of five stages of potential energy minimization in 
ligand translational and rotational degrees of freedom, 
with decreasing distance cutoff for pairwise interactions. 
The initial cutoff was set to 50 A and final cutoff was set to 
be 8 A. 

The solutions with converged potential energy were 
then clustered according to their pairwise RMSD dis- 
tances in order to remove redundant ligand poses. 
Scoring of the results was done according to their poten- 
tial energy. 

RESULTS AND DISCUSSION 

Potential parameters 

The obtained statistical potentials provide a valuable char- 
acterization of the assumed interaction model (see 
Supplementary Data for the corresponding plots): they 
reasonably indicate attractive and repulsive pseudoatom 
pairs, define the distances of closest approach and give the 
estimate of interaction strength. 

Their use as a definite and only basis for deriving inter- 
action potentials in the form of Equations (1) or (2) is, 
however, limited. First, due to a relatively small number of 
available, high-quality protein-RNA complexes, for some 
particularly infrequent bead pairs, considerable unphysic- 
al oscillations in the statistical potential are observed, ren- 
dering the fitting procedure inaccurate. Second, in some 
cases the statistical potential does not reach the zero 
baseline for distances as large as 14 A, requiring an arbi- 
trary decision whether the potential should be shifted 
before fitting, and if yes, what offset value should be 
used. Finally, the assumption that a statistical potential 
between two beads describes their pairwise interaction in 
the context of specific macromolecular environment is not 
physically justified (33). 

Due to these reasons, the statistical potentials were used 
to generate only an initial guess for the parameter set 
(called 'set 0' in the following), with its subsequent opti- 
mization, specifically for docking and scoring application, 
in mind. Given the large dimensionality of parameter 
space and relatively limited size of the training set, it is 



reasonable to expect that the result of parameter optimiza- 
tion would be sensitive to the order of pairwise potentials 
adjustments. As no indisputably superior adjustment 
sequence was determined, a random scheme was 
adopted and three independent optimization procedures 
were carried out. In addition, a consensus parameter set 
was constructed, obtained by averaging of respective a 
and e parameters of all three sets. 

Pairwise interactions amplitudes — well depths (negative 
values) for attractive potentials or saddle point heights 
(positive values) for repulsive potentials — for the consen- 
sus set (Figure 4), agree, in general, with the expected 
protein-RNA interaction characteristics. If the percentage 
of strong interactions (defined as having an amplitude 
higher than average repulsive or attractive interaction 
plus their respective one standard deviation) offered by a 
given bead is a measure of its importance for the specifi- 
city of protein-RNA recognition, the aromatic protein 
side chains (Tyr, Trp, Phe, His) appear to be the most 
contributing group, in line with the traditional view (34). 
They preferentially interact with nucleic bases rather than 
RNA backbone, which reflects stacking interactions they 
are supposed to make. 

Strong, mostly attractive interactions are observed for 
positively charged arginine. They are directed predomin- 
antly toward nucleic bases, at least if its distal, ARGS, 
bead is concerned. Interestingly, a similarly charged 
lysine appears to favor the RNA backbone instead, 
while the rest of its interactions is only moderate, with 
the exception of strong attraction to GG4 guanine 
pseudoatom. As expected, strong attraction is also 
observed between protein backbone nitrogen and 
pseudoatoms representing hydrogen bonds acceptors on 
nucleic bases. 

The repulsive interactions, while not driving the 
complex formation, are also important for binding speci- 
ficity. A significant repulsion is observed between nega- 
tively charged RNA phosphate groups and ionized side 
chains of Glu and Asp, and also for Gin, Asn and 
protein backbone oxygen. Interestingly, Cys appears to 
be in general the most RNA-repelling amino acid, 
both when phosphate backbone and nucleic bases 
are concerned, even though it does not carry a net 
charge. 

The weakest interacting beads, in turn, belong to hydro- 
phobic residues (He, Leu, Val, Ala, Gly) on the protein 
side, and sugar moiety on the RNA side. 

As expected, the exact values of parameters for the 
optimized interaction potentials vary across all three par- 
ameter sets. An average deviation for pairwise interaction 
with respect to mean location of its potential minimum or 
saddle point is 0.07 A on distance scale (S a ) and 0.18 kcal/ 
mol on energy scale (S e ). The observed combined devi- 
ations (see 'Methods' section) differ considerably among 
all pairwise potentials (Figure 5). To some extent, they 
reflect a statistical uncertainty due to an absolute 
number of bead pairs used to derive a given potential: 
rarely observed pseudoatoms like those in Trp or Met 
tend to have greater variability in their interaction poten- 
tials, whereas frequent beads like those in RNA or protein 
backbone tend to contribute to more repetitive potentials. 
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Figure 4. Color map: average amplitudes — well depths (negative 
values) or saddle point heights (positive values) — of interaction poten- 
tials for all bead pairs. Side plots: the percentage of strong interactions 
for each bead; strong interactions are denned as having an amplitude 
higher than average plus 1 SD. 
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Figure 5. Deviations of minimum or saddle point locations for three 
sets of pairwise potentials with respect to consensus set. Side plots: 
average deviations for each pseudoatom type (points), relative 
frequencies of pseudoatoms found in interacting pairs (histograms). 
All data normalized such that average values equal one. 



Such trend is also visible for nucleic bases with most often 
encountered guanine having on average lower deviations 
than less frequent uracil and adenine. For some ubiqui- 
tous pseudoatoms, however, like those in Arg, Lys or 
protein backbone oxygen, average deviations, though 
still below the mean for all protein beads, remain relatively 
high. Interestingly, those pseudoatoms are among the 
group that seems to particularly contribute to protein- 
RNA recognition. Perhaps, having high impact on 
docking performance, they are also exceptionally sensitive 
to the order of parameter optimization, and each time 
their interactions are being optimized, the adjustment 
depends on the actual state of all other pairwise potentials. 
It indicates that there are many local optima in the inter- 
action parameters space, resulting in similar docking effi- 
ciency for the limited training set. In order to mitigate the 
effect of such random parameters overfitting toward the 
training set structures, a consensus parameter set was 



constructed by averaging all respective a and e values of 
the three optimized sets. 

Docking performance 

Upon completion of parameters derivation, a systematic 
docking search was carried out for all non-ribosomal 
complexes from the training set (64 in total) and the test 
set (25 in total). Docking performance was evaluated 
based on interface RMSD (iRMSD) relative to the 
native interface composed of protein-RNA bead pairs 
found within the cutoff distance of 8 A in the crystallo- 
graphic structure, and the fraction of established native 
contacts (f NC ) within such interface. A docked ligand 
pose was considered as a 'hit', with iRMSD <2 A and 
,/nc >0.3, or iRMSD >1 A with / NC >0.5. Such criteria 
are equivalent to 'hit' being 'high' or 'medium' quality 
solution according to the CAPRI challenge (35) 
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Table 1. Docking results for the initial parameter set (0), three optimized parameter sets (1, 2, 3) and the consensus set (C) 



Parameter set Training set Test set 







A'l 


Wio 




A',, 


N t 


N 10 




0 


88 (72) 


19 (19) 


23 (23) 


67 (50) 


92 (60) 


4 (4) 


8 (8) 


72 (48) 


1 


100 (86) 


50 (41) 


75 (59) 


86 (66) 


100 (84) 


36 (36) 


56 (56) 


92 (76) 


2 


100 (89) 


52 (42) 


81 (62) 


92 (70) 


100 (80) 


40 (40) 


64 (56) 


88 (76) 


3 


100 (84) 


62 (53) 


73 (58) 


86 (66) 


100 (84) 


40 (40) 


52 (52) 


88 (80) 


C 


100 (83) 


61 (53) 


80 (66) 


88 (70) 


100 (84) 


48 (44) 


76 (72) 


100 (84) 



Numbers correspond to the percentage of hits for crystallographic structure energy minimization (N s ), and for systematic search within best rank 
solutions (Ni), within 10 best solutions (N w ), and within all docked poses (A^n). Numbers in brackets correspond to solutions of 'high' quality (see 
text for definition). 



guidelines. Separately, the statistics for hits of only 'high' 
quality (i.e. with iRMSD <1.0 A and / NC >0.5) was 
determined. 

Before performing a systematic docking search, in order 
to evaluate forcefield ability to provide stable complexes 
consistent with experimental binding modes, potential 
energy minimization of crystallographic structures was 
carried out. For each parameter set, except the 
non-optimized set 0, all complexes, both in the training 
and test set (Table 1), were found to be stable in the sense 
that the converged energy minimum met the adopted 
criteria for 'hit'. Most of the the optimized geometries 
(~85%) corresponded to high-quality solutions. 
Interestingly, in some cases when crystallographic 
complex distortion exceeded the assumed tolerance for 
high-quality model, such solution was found during sys- 
tematic docking search, though not necessarily scored as 
the best. It is worth stressing that crystallographic struc- 
tures are usually computationally optimized with the use 
of atomic forcefields, and some residual strains are likely 
to be encountered when switching to a different set of 
interaction potentials. Hence, the imbalance of particular 
receptor-ligand geometry during rigid body energy mini- 
mization does not preclude the forcefield ability to provide 
energy minimum that corresponds to physically equivalent 
binding mode. 

The performance of parameter set obtained solely upon 
fitting to statistical potentials (set 0) in terms of providing 
stable native-like geometries, was surprisingly good, given 
that its parameters were not tuned in any way to achieve 
this task. Around 90% of complexes remained stable, and 
majority of the 'optimized' geometries corresponded to 
high-quality solutions, with no significant difference 
between the training and test set. 

The results of systematic docking searches are presented 
in Table 1. For optimized and consensus potentials, 
native-like solutions in training set complexes are found 
to have the highest rank (lowest potential energy) in 
50-62% cases, depending on parameter set used. The 
success rate increases significantly when native-like solu- 
tions are allowed to have up to 10th rank, and further 
increase in rank threshold brings only moderate improve- 
ment (Figure 6). For the test set, optimal ranking is 
achieved for 36^18% of structures, and the success rate 
tends to saturate at higher rank threshold (Figure 6). 
Interestingly, portions of generally found native-like 
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Figure 6. The percentage of hits at given rank threshold obtained for 
three optimized parameter sets and the consensus set. Lower set of 
dashed lines: the numbers of hits with rank 1, upper set of dashed 
lines: the numbers of generally found native-like solutions. 



solutions are similar for both training and test sets, 
indicating that the force field ability to score the results 
is more prone to overfitting than its ability to provide 
proper geometries. This can be understood, as binding 
geometries depend mostly on effective pseudoatomic 
radii, while energy values are additionally sensitive to 
amplitudes of interaction potentials. 

As expected, the performance of parameters from set 0 
is much worse compared with optimized and consensus 
potentials. While the fraction of generally found solutions 
is moderately good, in accordance with the number of 
stable structures found during crystallographic complex 
energy minimization, the ability of non-optimized poten- 
tials to score the results is rather poor. Indeed, the position 
of minimum (or the range of hard core bead-bead inter- 
action) in statistical potentials seems to be relatively 
reliable and error-resistant physical descriptor in 
contrast to interaction strength. As a result, set 0 poten- 
tials approximate shape complementarity but fail in 
providing a meaningful estimate of the binding free 
energy. 

The performance of the three optimized parameter sets 
is generally similar within training or, separately, test set 
structures. The consensus parameter set does not bring 
much difference with respect to training set, but clearly 
outperforms the three optimized potentials when test set 
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Table 2. Relative fractions (in %) for two types of docking errors 



Parameter set 


Training 


set 


Test set 




Ei 


E 2 


Ei 


E 2 


0 


60 


40 


71 


29 


1 


72 


28 


88 


12 


2 


84 


16 


80 


20 


3 


62 


38 


80 


20 


C 


68 


32 


100 


0 



training s. □ 
test s. □ 
estimate 



El: the hit is found but scores improperly; E2: the hit is not found at 
all during systematic search. 



structures are considered. It suggests, that parameter 
averaging might indeed have contributed to the the reduc- 
tion of inter parameter correlations arising from 
overfitting. 

If obtaining a native-like solution, as the one with the 
lowest potential energy, is considered as an ultimate goal 
of docking study, the most common reason for the lack of 
success is improper scoring (error of type 1, El). It 
accounts for ~70% of failures for the training set struc- 
tures and >80% for the test set structures (Table 2). The 
inability of systematic search procedure to find native-like 
binding geometry (error of type 2, E2), corresponds to 
16-38% of failures for the training set and up to 20% 
for the test set. As noted above, the actual percentage of 
structures for which systematic search fails to find a 
native-like geometry is similar for training and test sets 
(~10%), and differences in relative frequencies of El 
versus E2 errors for those sets result from generally 
worse scoring for the test set structures. Error distribution 
obtained with non-optimized parameter set 0, shows a 
greater proportion of E2 errors when compared with 
optimized and consensus parameter sets. Again, it indi- 
cates that the optimization of shape complementarity 
encoded in the potentials is more efficient than the opti- 
mization of their scoring capabilities. 

From the structural point of view on docking efficiency, 
the considered protein-RNA complexes tend to be either 
easy (docked and scored within top 10 for all four param- 
eter sets) or difficult (with rank worse than 10 for all four 
parameter sets). The fractions of complexes belonging to 
those two classes (Figure 7) are much higher than esti- 
mates based on the supposition that the success (or 
failure) in docking for one parameter set is independent 
from the results for other parameter sets, i.e. can be 
expected with probability being a product of success (or 
failure) probabilities in each case. Clearly, to some degree 
it is the effect of correlations between parameter sets, as 
they are derived from a common predecessor, but more 
importantly it is a consequence of structural properties of 
the investigated complexes. 

In most cases, complexes regarded as difficult have 
small interface region (see Figure 9 for examples: 2F8S, 
2DLC, 1 YYW). The number of interfacial beads on both 
receptor and ligand side for ~80% of difficult structures is 
within the lowest 20% of interfacial bead numbers for all 
complexes (Figure 7). Typically in such cases, a native-like 
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Figure 7. Left: the distribution of easy, medium and difficult docking 
structures, and an estimate of their relative frequencies assuming no 
correlation between docking efficiency and structural properties. 
Right: the distribution of structural features among difficult complexes; 
R IB, L IB — the number of interfacial beads in receptor and ligand, 
R S, L S — receptor and ligand accessible surface area. Features ranges 
are normalized with 0 and 100 corresponding to the lowest and highest 
observed values, respectively. 



geometry is found but scores badly, as alternative solu- 
tions provide more, albeit usually worse, contacts. On 
the contrary, in 2 of 11 difficult cases (1F7U, IFFY) the 
interface size is exceedingly large (close to the maximum of 
the observed interfacial bead number range). For those 
complexes, native-like solutions are not found at all 
during systematic search, as tightly packed, large 
binding region precludes successful docking of rigid 
partners, even though crystallographic structures corres- 
pond to stable energy minima. In general, the size of 
protein interface appears to be slightly more important 
for docking efficiency. Perhaps, it is due to greater contri- 
bution of protein partners to specific recognition (22), 
which is a consequence of greater variability among 
protein polymer building blocks in comparison with just 
four types of standard RNA nucleotides. 

The absolute size (accessible surface area) of binding 
partners seems to have little influence on docking effi- 
ciency. Structures of different sizes are in general evenly 
distributed among difficult and non-difficult complexes 
(Figure 7), with the exception of three largest ligands. 
Two of them, however, (2F8S, 1SER) appear to have 
rather small interface regions, while the third one 
(IFFY) was already mentioned as having extremely 
large and tight interface. 

Improving docking performance 

As noted above, proper scoring of docked geometries 
appears to be the major factor that affects docking per- 
formance. A possible way to increase scoring efficiency is 
to use information about the expected interface location in 
order to amplify ligand-receptor interactions originating 
from this area. To date, a number of methods have been 
developed for the prediction of RNA binding sites on 
proteins (18-23). Unfortunately, predicting protein 
binding sites on RNA is much more difficult task and, 
apart from general characterization of protein binding 
interface on RNA (13-17), to our knowledge no predictive 
methods exist to tackle this problem. Due to this reason, 
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Figure 8. Rescoring of 200 lowest energy solutions with protein inter- 
facial interactions scaled by weight W. N w jN\ — the ratio of best ranked 
hits for weight W to original result with W — 1. Sensitivity and speci- 
ficity refer to the prediction quality of protein interface region. Black 
dots indicate prediction quality of some available methods taken from 
Ref. (23). 



only the influence of protein interface predictions will be 
considered. 

Instead of evaluating the usefulness of particular 
approaches for the prediction of RNA binding region in 
proteins, a generic estimate of the predictive power 
required to improve scoring was carried out. The true 
protein interface was assumed to consist of beads 
involved in the formation of native contacts with RNA 
(see above for definition). Its status was perturbed to the 
desired level of specificity and selectivity by randomly 
introducing some false positive and false negative inter- 
facial beads to protein structure. Finally, for each such 
'prediction', two hundred lowest energy solutions of sys- 
tematic docking search for each complex were rescored 
with the interactions for ligand 'interfacial' beads scaled 
up by some weight factor (W). An alternative scheme, in 
which only attractive interactions were scaled, was also 
considered but the results were qualitatively the same 
and, hence, are not shown. Due to random character of 
beads reassignment, interface perturbation and rescoring 
of docking results were repeated five times for each pre- 
diction quality and weight factor combination, and 
average numbers of best ranked native-like geometries 
were recorded. 

Moderate scaling (W = 1.25) of the predicted native 
interactions appears to have in general a small beneficial 
effect on scoring efficiency (Figure 8). Even with relatively 
low prediction quality, it provides up to ~10% increase in 
top ranked hits. Greater weighting factors allow reaching 
better results, up to ~30% increase in optimal scores. This 
effect tends to saturate, however, and interaction scaling 
above W = 5 does not improve the results, even for ideally 



'predicted' interface (with sensitivity and selectivity of 1). 
This is understandable, as predictions affect only the 
protein binding partner and thus, false solutions, in 
which a correct protein region binds to an incorrect 
RNA region, can also benefit from interaction scaling. 

Interestingly, higher W values require better prediction 
quality in order to improve scoring efficiency. In fact, poor 
predictions tend to even worsen the original (W = 1) 
results, and this effect becomes increasingly evident with 
growing W. At the same time, high sensitivity appears to 
be somewhat more beneficial than specificity, suggesting 
that finding as many interfacial pseudoatoms as 
possible, even at the expense of false positive predic- 
tions, is favorable to capturing only few, but truly inter- 
facial beads. Indeed, in the latter case, some false 
solutions that partially overlap with the correct 
binding region may easily benefit from high W values, 
if only they involve contact through singular promoted 
interfacial beads. 

Confronting the above considerations with the expected 
effectiveness of currently available methods for RNA 
binding site prediction (Fig. 8), indicates that making 
use of their results would improve scoring efficiency by 
10-20% with moderate interaction scaling (W<5). For 
W = 5 and above, most methods seem to give prediction 
quality that may not be sufficient for such improvement, 
or may even deteriorate scoring. 

The inability of systematic search procedure to find 
native-like binding geometry (E2 error type) was the 
second reason for the lack of docking success. In some 
cases, like those with large and tight interface, relatively 
little can be done to avoid this type of failure within the 
framework of rigid structures docking. In other cases, with 
just narrow 'binding funnel' leading from starting points 
for energy minimization to a bound complex, an increase 
in the density of initial ligand placements should be of 
help. Indeed, a 2-fold increase of starting points density 
limits the fraction of complexes with no native-like 
solution from 12% to 5% for the training set (results for 
the consensus parameter set). Furthermore, three out of 
all five newly found hits score with top rank (and represent 
'high'-quality solutions), indicating that E2 errors occur 
indeed due to geometrical constraints rather than 
forcefield inability to provide a proper energy landscape. 
In the case of test set, there were no structures with E2 
error for the consensus parameter set, hence no improve- 
ment in this respect was observed, but one additional 
high-quality solution was found when the number of 
starting points was doubled. Certainly, an increase in the 
number of initial ligand placements results in a propor- 
tional increase in the time needed for systematic search, 
but such computational burden can be reduced by the in- 
clusion of information about the putative binding site 
location on RNA and appropriate starting point density 
modulation. 

Docking of unbound structures 

The docking of protein to RNA in their unbound con- 
formations represents scenario encountered in real-life ap- 
plications. On top of difficulties related to obtaining 
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proper ligand placement and its scoring, a significant 
obstacle arises due to local and global conformational 
changes that likely occur upon binding and need to be 
predicted for successful docking. Addressing this issue is 
beyond the scope of the present report, focused on the 
development of an intermolecular interaction potential. 
Nonetheless, the performance of rigid docking with the 
use of consensus parameter set was checked on few avail- 
able unbound structures of both binding partners in order 
to give an estimate for the level of structural deformation 
that still does not preclude successful docking. 



Unbound structures were required to have at least 95% 
sequence identity of the overlapping region with bound 
version, both in the case of RNA and proteins. As there 
are very few such instances, in the case of bound RNA in 
the form of straight double helix, for which exists an 
unbound protein partner, a modeling of 3D structure 
was performed using secondary structure as an input. 
The program Assemble (36) was used for this task. The 
criteria applied so far for the detection of 'hit 1 were 
extended to incorporate solutions of 'acceptable' quality 
according to CAPRI classification, that is having iRMSD 
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Table 3. Results for docking with unbound structures 



Complex 
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1Y0Q, (2.6) 
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2E9T iKO 
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Subscripts at PDB id denote involved chains, numbers in brackets correspond to RMSD (P atoms for receptor and Ca for ligand) versus bound 
structure (in case of multiple NMR models an average value is provided); br, bl, ur, ul — bound/unbound receptor/ligand; numbers correspond to the 
ranks of 'hits' or 'acceptable' solutions (in square brackets). 
Asterisk denote NMR structure. 



<4.0 A and 0.1 </ NC <0.3, or iRMSD >2.0 A and/ NC 
>0.3. 

Though the statistics based on just few presented cases 
is not particularly meaningful, it allows to divide the 
results in two classes. 

In cases when the protein partner binds on the surface 
of double RNA helix (Figure 9, 1DFU, 1JBR, 2AZ2, 
2RKJ), docking results appear to be more sensitive to 
the distortion on protein rather than RNA side 
(Table 3). RMSD up to 2.8 A on the RNA side did not 
preclude finding a hit or acceptable solution within the top 
20 solutions, when docking of protein in its bound con- 
formation was considered. At the same time, for bound 
RNA-unbound protein pair, there was no correspond- 
ingly high ranked solutions, even for relatively low 
protein RMSD. Such result may be expected, as in those 
cases specific binding occurs due to, generally movable, 
protein side chains, while the topology of RNA minor 
and major groves remains relatively unchanged. The sta- 
bility of helical RNA and rather loose binding geometries 
account also for quite successful docking of two partners 
in their unbound conformations: in all three cases, solu- 
tions were found within 100 top geometries. 

The situation is more difficult in the case of protein 
binding to single-stranded termini of RNA chains 
(Figure 9, 1TTT, 2E9T). Here, the distortion of 
unbound RNA allows for finding only poorly scored so- 
lutions of 'acceptable' quality. Also, as binding is gener- 
ally tighter than in previous cases } even small change in 
protein structure (RMSD of 0.8 A for 2E9T) gives un- 
favorable score for the hit found. In both cases, docking 
of two unbound partners was unsuccessful. 

The remaining complex (Figure 9, 1K8W) does not 
belong to the above classification. Here, binding involves 
a long RNA loop with most bases in flipped out geometry. 
The docking of unbound protein to bound RNA is sur- 
prisingly successful, but solutions for unbound RNA con- 
formation are not found. This, however, does not need to 
result from the changes in the geometry of binding site on 
RNA, but from the fact that unbound RNA represents a 
complete molecule (tRNA) whose fold outside the region 
overlapping with bound structure causes significant 
sterical clashes with bound protein, thus precluding any 
successful binding. Whether tRNA needs to refold upon 
binding, or the crystallographic structure of complex 



involving incomplete RNA does not represent real 
geometry, remains an open question. 



CONCLUDING REMARKS 

A new coarse-grained force field for protein-RNA inter- 
actions targeted on docking applications was presented. It 
is compatible with earlier representation developed for 
protein complexes and suitable to use within the 
ATTRACT docking protocol (28), based on intermolecu- 
lar potential energy minimization. The force field was 
parametrized and tested with the use of 110 crystallo- 
graphic structures of protein-RNA complexes. It 
showed a good performance in systematic docking 
search for protein-RNA binding geometries when two 
partners were considered in their bound conformations. 
The presented results were obtained without any addition- 
al constraints based on information regarding the putative 
binding site location; however, the possible role of such 
information in augmenting docking efficiency was 
demonstrated. 

The application to unbound structures showed 
promising results in the cases with loose binding inter- 
faces, for which moderate deformations of global struc- 
ture do not affect much the topology of native contacts. In 
the cases where complex formation involves local con- 
formational changes and the creation of sterically tight 
interpenetrating binding interfaces, rigid body docking 
was obviously unsuccessful, however, due to reasons not 
related to forcefield performance. 

Addressing the issue of global and local conformational 
changes in protein and RNA binding partners is a signifi- 
cant challenge, remaining within the scope of not only 
docking, but also folding community. An important 
problem, particularly affecting the development of 
docking methodologies, is the small amount of available 
structural data representing both the geometry of a 
complex and its components in free forms. The currently 
presented forcefield, providing a reasonable estimate of 
intermolecular interaction energy at low computational 
expense, is a good basis to be coupled with existing and 
future methods accounting for molecular deformability at 
different structural levels. Such combination, leading to 
flexible protein-RNA docking, is in the center of our 
ongoing efforts. 
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